Vintern-3B-R-beta is a multimodal large language model focused on complex reasoning tasks based on images, capable of decomposing reasoning steps and effectively controlling hallucination phenomena.
Image-to-Text
Transformers Supports Multiple Languages